This file describes the compression algorithm used by my program "Compressor". Compressor converts TXT or AWP files into compressed (what else? :-) files which can be read by my text file displayer/printer "Dogpaw". Dogpaw is available in GEnie's A2 and A2Pro libraries. Compressor and Dogpaw are public domain; use 'em for anything, give 'em to anyone. Included with this file is the full source code to Compressor. This program was only my second-ever assembly language project, so please, no snickering over any lameness you may find in the source. The source is in Orca/M GS format. MAKEBIN must be used after assembly; Compressor is a BINary program, and runs under BASIC.SYSTEM. Thanks to the simplicity of the compression algorithm, only a small portion of the code actually deals with compression/decompression.
I don't know if the compression algorithm described here has an "official" name. I refer to it as "five eighths byte packing", _a la_ "half byte packing"; an algorithm that can be used with data that is confined to a 16-member character set. This algorithm is far less efficient than such compression methods as LZW (ShrinkIt), and it only works with text, but it has the advantages of speed, simplicity, and very small memory requirements.
This algorithm works by compressing only the 31 most commonly used text characters: lower case letters a through z, and a few other characters. Characters that fall into this "pseudo character set" are saved as 5-bit numbers (%11111 = 31). Three of these 5 bit numbers will fit into 2 bytes, with one bit left over. Thus, you save one byte for every three consecutive compressed characters; this gives you a theoretical maximum compression of 33 1/3%.
Characters that aren't in the pseudo character set are simply stored uncompressed. The left over bit at the end of a two-byte "package" of compressed characters is used to flag whether the following byte is the first byte in another two-byte compressed package, or a straight ascii byte. Likewise, the high bit of a straight ascii character is used to flag whether or not the byte that follows _it_ is compressed.
When compressing a file, the algorithm will likely run across an "uncompressible" character when it's in the middle of a two-byte compressed-character package. When this happens, the remainder of the two-byte package is filled with zeros. This is why there are 31 characters in the pseudo-set, rather than 32. Zero is reserve as a null character. Whenever the decompression algorithm sees that the 5 bit number it's extracting is a zero, it simply ignores it and goes on to the next (if any). When it reaches the end of each two-byte package, it checks the final bit to see if the next byte is compressed or not.
At the start of a compressed file, if the first byte is non-zero then it's a straight ascii character. If it's zero, this flags that the first character of the source (pre-compressed) file was compressible, and that bytes number two and three are a two-byte compressed package.

So, a little more graphically, here's how Compressor does it...


5-bit numbers: 1 thru 26: a thru z

other characters:    27      28       29         30        31
                   space   comma  apostrophe   period   <return>

0: null; ignored


Two-byte compressed package:      letter1 letter2 letter3
                                      _|_  __|__  _|_   flag bit
                                     |   ||     ||   | /
                                     11111111  11111111

Flag bit: If clear, next byte is straight ASCII.

Straight ASCII bytes: If high ASCII, next byte is also straight ASCII.

First byte in compressed file: If 0, next two bytes are a two-byte compressed package; if non-zero, then it (the first byte) is uncompressed ASCII.